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Abstract 

The  researchers  made  significant  progress  in  all  of  the  proposed  research  areas.  The 
first  major  task  in  the  proposal  involved  simulation-based  and  sampling  methods  for  global 
optimization.  In  support  of  this  task,  we  have  discovered  two  new  innovative  approaches 
to  simulation-based  global  optimization;  the  first  involves  connections  between  stochastic 
approximation  and  our  model  reference  approach  to  global  optimization,  while  the  second 
connects  particle  filtering  and  simulation-based  approaches  to  global  optimization.  We  have 
also  made  significant  progress  in  population-based  global  optimal  search  methods,  applica¬ 
tions  of  these  algorithms  to  problems  in  statistics  and  clinical  trials,  and  efficient  allocation 
of  simulations. 

In  support  of  the  second  task,  we  have  made  progress  incorporating  simulation-based 
and  sampling  methods  into  Markov  Decision  Processes  (MDPs).  We  have  made  signifi¬ 
cant  progress  on  new  sampling  methods  for  MDPs,  simulation-based  approaches  to  partially 
observable  Markov  decision  processes  (POMDPs),  and  applications  of  these  algorithms. 
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1  Introduction 

In  this  research  project,  we  proposed  to  investigate  basic  questions  aimed  at  challenges  in 
information  superiority,  logistics,  and  planning  for  the  Air  Force  of  the  future.  In  particular, 
we  proposed  to  investigate  simulation-based  methodologies  for  global  optimization  and  plan¬ 
ning  that  can  be  effective  tools  in  an  integrated  approach  to  Global  Awareness  (Intelligence, 
Surveillance  and  Reconnaissance,  or  ISR),  Command  and  Control,  planning,  and  logistics. 

Such  systems  are  exceedingly  complex,  and  we  combined  four  approaches  in  the  study  of 
such  problems: 

•  Developing  and  studying  efficient  simulation-based  and  sampling  methodologies  for 
global  optimization  problems; 

•  Studying  the  application  of  these  global  optimization  methodologies  to  practical  prob¬ 
lems,  such  as  those  arising  in  planning  for  unmanned  aerial  vehicles  and  data  mining; 

•  Developing  and  studying  efficient  simulation-based  and  sampling  methodologies  for 
problems  of  dynamic  decision  making  under  uncertainty; 

•  Studying  the  application  of  these  dynamic  optimization  methodologies  to  practical 
problems,  such  as  inventory  control,  preventive  maintenance,  and  optimal  stopping. 

1.1  Simulation-Based  and  Sampling  Methods  for  Global  Opti¬ 
mization 

Simulation  is  used  to  model  complex  stochastic  systems  arising  in  applications  from  supply 
chain  management  to  financial  engineering  to  telecommunications,  among  many  others.  In 
addition  to  performance  evaluation,  optimization  —  or  at  least  improvement  —  of  the  system 
is  clearly  a  desirable  objective. 

Following  [53],  we  distinguish  between  instance-based  and  model-based  global  optimiza¬ 
tion  solution  methods.  In  instance-based  methods,  the  search  for  new  candidate  solutions 
depends  directly  on  previously  generated  solutions,  e.g.,  simulated  annealing  [33],  genetic 
algorithms  (GAs)  [18],  tabu  search  [17],  and  nested  partitions  [41],  On  the  other  hand,  in 
model-based  algorithms,  new  candidate  solutions  are  generated  via  an  intermediate  prob¬ 
ability  model  that  is  iteratively  updated.  Our  research  has  focused  on  the  model-based 
optimization  framework,  which  involves  the  following  ingredients: 

(0)  specify  probability  distribution  over  solution  space; 

(I)  generate  candidate  solutions  by  sampling  from  distribution; 

(II)  estimate  performance  of  (and  possibly  improve)  candidate  solutions; 

(III)  update  distribution  based  on  selected  ( “elite” )  set  of  candidate  solutions. 

This  approach  retains  the  primary  strengths  of  population-based  approaches  such  as  ge¬ 
netic  algorithms  —  improving  upon  simulated  annealing,  which  works  with  a  single  iterate 
at  a  time,  while  at  the  same  time  providing  more  flexibility  in  exploring  the  entire  solution 
space,  introducing  more  structure  in  the  search  procedure,  and  allowing  theoretical  prop¬ 
erties  to  be  studied  regarding  both  finite-time  performance  and  asymptotic  convergence. 
The  theory  behind  the  framework  is  rigorous,  but  based  on  an  idealized  version  of  the  last 
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three  ingredients,  specifically  the  distribution  sequence,  sampling  from  the  distribution  se¬ 
quence  (or  from  a  surrogate  or  approximation),  and  estimation  of  the  performance,  since  it 
is  observed  through  simulation.  Schematically,  we  seek  a  sequence  of  distributions 

9o,  9i,92,  •••  — »  9oo, 

where  concentrates  its  mass  around  the  optimal  solutions. 

Examples  for  the  sequence  of  distributions  {gk}  include  the  following: 

(a)  proportional  selection  scheme  —  introduced  in  estimation  of  distribution  algorithms 
(EDAs)  [48],  and  the  instantiation  of  our  model  reference  adaptive  search  (MRAS) 
method  in  [25]; 

(b)  Boltzmann  distribution  with  decreasing  (nonincreasing)  temperature  schedule  —  used 
in  the  annealing  adaptive  search  (AAS)  [45,  40]; 

(c)  optimal  importance  sampling  measure  —  used  in  the  cross-entropy  (CE)  method  [13]. 

Case  (c)  is  easiest  to  implement,  but  may  not  converge  to  a  global  optimum.  The  other  two 
cases  have  nice  theoretical  properties:  Case  (a)  guarantees  improvement  in  expectation  [25] , 
whereas  case  (b)  guarantees  improvement  in  stochastic  order  [45].  All  three  will  be  used  in 
the  proposed  research. 

However,  in  all  three  cases,  the  sequence  {gk}  is  unknown  explicitly  a  priori,  or  else  the 
problem  would  essentially  be  solved.  So  at  iteration  k,  sampling  is  done  from  a  surrogate  dis¬ 
tribution^)  that  approximates  gk-  There  are  two  main  approaches  that  have  been  adopted: 

•  Markov  chain  Monte  Carlo  approximation  to  the  target  distribution  at  each  iteration 
[45],  which  involves  generating  a  sequence  of  sample  points  from  a  sequence  of  dis¬ 
tributions  {Tiki}  following  a  Markov  chain  that  asymptotically  converge  to  the  target 
distribution  (at  that  iteration),  i.e., 

Til,  Tc2,  •••  — »  gk] 

e.g.,  a  common  implementation  is  the  “hit-and-run”  algorithm  [42,  46,  47]; 

•  projection  onto  distributions  that  are  easy  to  work  with,  e.g.,  use  a  family  of  param¬ 
eterized  distributions  {fe},  and  project  gk  onto  the  family  to  obtain  a  sequence  that 
converges  to  the  (final)  target  distribution,  i.e., 


fs o ,  fOi,  fo 2,  •••  *  9oo , 

a  common  implementation  minimizes  the  Kullback-Leibler  (KL)  divergence  between 
fgk  and  gk  at  each  iteration,  because  it  leads  to  analytically  tractable  solutions  if  the 
parameterized  distributions  are  from  the  exponential  family. 

The  first  approach  is  adopted  by  AAS,  and  generates  a  sequence  of  candidate  solutions  at 
each  iteration.  Our  MRAS  method  and  the  CE  method  follow  the  second  approach,  which 
leads  to  a  population  of  candidate  solutions,  from  which  an  elite  set  is  selected  and  used  to 
update  the  distribution. 
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1.2  Simulation-Based  and  Sampling  Methods  for  Markov  Decision 
Processes 

Simulation  optimization  problems  arising  in  supply  chain  management,  path  planning  for 
unmanned  aerial  vehicles,  financial  engineering,  and  telecommunications  are  characterized 
by  two  critical  aspects:  changing  dynamics  and  stochastic  events.  For  example,  effective 
supply  chain  management  requires  optimal  responsive  actions  in  the  face  of  both  gradual 
shifts  in  demand  patterns  -  e.g.,  due  to  technology  advances  -  and  sudden  unpredictable 
disruptions  in  production  capacity  -  e.g.,  clue  to  an  unanticipated  manufacturing  facility 
shutdown.  Such  systems  often  require  computationally  expensive  simulation  models  for  per¬ 
formance  estimation,  such  as  modeling  the  operations  of  an  entire  semiconductor  fabrication 
facility,  where  simulation  runtime  is  typically  on  the  order  of  hours.  Markov  decision  pro¬ 
cesses  (MDPs)  provide  a  powerful  paradigm  for  modeling  optimal  decision  making  under 
uncertainty  in  these  settings,  but  MDPs  suffer  from  the  well-known  curse  of  dimensionality, 
which  can  include  exponential  growth  in  the  size  of  state  spaces  and  action  spaces  with  the 
problem  size;  thus,  direct  numerical  solution  of  MDPs  for  large-scale  real-world  problems  is 
not  currently  feasible.  In  general,  heuristics  and  approximations  are  employed  to  simplify  the 
MDP  model.  Perhaps  the  most  successful  examples  of  this  approach  has  been  approximate 
dynamic  programming  using  value  function  approximation  (cf.  [2,  12,  37,  39,  44]),  but  the 
size  of  problems  that  can  be  solved  remains  relatively  small  compared  to  large-scale  real- 
world  problems  of  interest.  The  approaches  studied  in  this  research  are  meant  to  complement 
these  highly  successful  techniques. 

2  Research  Results 

2.1  Simulation-Based  and  Randomized  Methods  for  Global  Opti¬ 
mization 

Consider  finding  the  optimal  solution  to  the  problem  of  the  form: 

x*  G  arg  min  H (x) ,  (1) 

x&L 

where  X  C  is  the  solution  space,  which  can  be  either  continuous  or  discrete,  and  H(-)  : 
X  — >  3?  is  a  bounded,  deterministic  function.  We  assume  the  existence  of  the  optimal 
solution  x*.  However,  the  function  H  is  not  necessarily  convex  or  even  continuous,  and 
there  could  be  multiple  local  minima.  Note  that  in  a  more  general  stochastic  setting,  the 
objective  function  H  itself  may  take  the  form  of  an  expected  value  of  a  sample  performance 
h,  H{x )  =  E[h(x,ij))],  where  if  is  a  random  variable  (possibly  depending  on  x)  representing 
the  stochastic  effects  of  the  system,  and  only  estimates  of  noisy  function  h  are  available. 

The  theoretical  properties  and  practical  efficiencies  of  model-based  methods  are  primarily 
determined  by  the  two  key  questions  of  how  to  update  the  probability  models  and  how  to 
efficiently  generate  candidate  solutions  from  them.  In  [25],  the  Pis  have  introduced  a  general 
model-based  randomized  search  framework  called  model  reference  adaptive  search  (MRAS), 
where  these  difficulties  are  circumvented  by  sampling  candidate  solutions  from  a  family  of 
parameterized  distributions  {fe(-),9  G  0}  (henceforth  referred  to  as  sampling  distributions) 
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and  using  a  sequence  of  intermediate  reference  distributions  {gk}  to  facilitate  and  guide  the 
updating  of  the  parameters  associated  with  the  parameterized  family.  The  idea  of  projecting 
to  the  exponential  family  of  distributions  was  first  introduced  by  the  CE  method;  see  [25] 
for  the  differences  between  MRAS  and  the  CE  method.  The  idea  is  that  the  parameterized 
family  is  specified  with  some  structure  so  that  once  the  parameter  is  determined,  sampling 
from  each  of  these  distributions  should  be  a  relatively  easy  task.  An  additional  advantage  of 
using  the  parameterized  family  is  that  the  task  of  updating  the  entire  distribution  simplifies 
to  the  task  of  updating  its  associated  parameters.  In  MRAS,  the  parameter  updating  is 
carried  out  by  minimizing  the  Kullback-Leibler  (KL)  divergence  between  the  parameterized 
family  and  the  reference  distributions  {gk}- 

The  sequence  {gk}  is  primarily  used  to  express  the  desired  properties  of  the  method. 
Thus,  these  reference  distributions  are  often  selected  so  that  they  can  be  shown  to  converge 
to  a  degenerate  distribution  with  all  probability  mass  concentrated  on  the  set  of  optimal 
solutions.  Intuitively,  the  sampling  distribution  fgk  can  be  viewed  as  a  compact  approxima¬ 
tion  of  the  reference  distribution  gk  (i.e.,  the  projection  of  the  reference  distribution  onto  the 
parameterized  family).  Thus,  the  hope  is  that  the  sequence  of  sampling  distributions  retain 
the  convergence  properties  of  the  sequence  of  reference  distributions  {gk}-  The  key  steps  in 
each  iteration  (k)  of  MRAS  are  the  following: 

1.  Given  parameter  6k,  generate  a  set  of  N  candidate  solutions  A*,  =  {Xl, . . . ,  Xjf  }  by 
sampling  from  fgk(-),  and  obtain  their  objective  function  values  H(x)  \/x  G  Afc. 

2.  Determine  a  set  of  “elite”  candidate  solutions  A eiite  C  Ak  (e.g.,  the  set  of  solutions 
with  the  best  performance  function  values). 

3.  Update  the  parameter  9k+ 1  based  on  A eute  by  minimizing  the  KL  divergence: 


6k+ 1  =  arg  ruin  @ (y/fc+i ,  fo), 
eee 


@(9,f)  ■=  I  ln77 -laidx), 

J  x£\  J\X) 

and  set  k  <—  k  +  1. 

Step  2  above  is  in  the  same  spirit  as  the  selection  scheme  used  in  many  population-based 
approaches  such  as  genetic  algorithms. 

2.1.1  Model  Reference  Framework  for  Natural  Exponential  Families  (NEFs) 

For  the  exponential  family  of  distributions,  the  minimization  in  step  3  of  the  basic  algorithm 
can  be  carried  out  in  analytical  form,  which  makes  MRAS  also  easy  to  implement  efficiently. 

Definition:  A  parameterized  family  {fg(-),  6  e  0  C  on  X  is  called  a  natural  exponential 
family  if  there  exist  functions  T  :  and  K  :  ?R.m  — >  9R  such  that  fo{x)  = 

exp  (dTT(a;)  —  K{6)},  where  K{6)  =  In /xexp(d7T(;r))z/(<ia;),  and  u  is  the  Lebesgue/ discrete 
measure  on  X. 

The  function  K{6)  plays  an  important  role  in  the  theory  of  natural  exponential  families.  It 
is  strictly  convex  with  VK(9)  =  Ee[r(X)]  and  Hessian  matrix  Cove[r(A)].  Therefore,  the 
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Jacobian  of  the  mean  vector  function  m{6)  :=  ^[r(X)]  is  strictly  positive  definite  and  thus 
invertible.  From  the  inverse  function  theorem,  it  follows  that  m{6)  is  also  invertible. 

When  NEFs  are  used  in  the  framework  with  the  sample  size  N  adaptively  increasing  with 
rate  7  >  1,  the  following  result  [25]  establishes  the  convergence  of  the  sequence  of  parameters 
{9k}  to  the  optimal  parameter  6*. 

Theorem  1  Let  (3  >  0  be  a  constant  such  that  the  set  {x  :  S(H(x ))  ^  has  a  strictly 
positive  Lebesgue/ counting  measure.  If  7  >  (/3S(H(x*)))2,  then  lim^oo  6k  =  m-1(r(a;*)) 

w.p.l. 

This  theorem,  when  restricted  to  many  special  cases,  implies  the  convergence  of  the 
sequence  of  sampling  distributions  {fek(-)}  to  a  degenerate  distribution  on  the  set  of  optimal 
solutions.  For  example,  for  continuous  optimization  problems,  if  the  multivariate  Gaussian 
densities  A f(pk,  E&)  with  mean  vector  pk  and  covariance  matrix  E^  are  used  as  parameterized 
distributions,  then  a  simple  interpretation  of  the  theorem  gives 

lim  pk  =  x*  and  lim  E*,  =  0  w.p.l. 

k — >00  k — >00 


2.1.2  Extensions 

The  MRAS  methodology  is  extended  in  [26,  24]  to  an  algorithm  called  Stochastic  Model 
Reference  Adaptive  Search  (SMRAS)  for  Ending  the  global  optimal  solution  to  stochastic 
optimization  problems,  for  situations  in  which  the  objective  function  cannot  be  evaluated 
exactly,  but  can  be  estimated  with  some  noise  (e.g.,  via  simulation).  We  prove  that  SMRAS 
converges  asymptotically  to  a  global  optimal  solution  with  probability  one  for  both  stochastic 
continuous  and  discrete  problems.  Numerical  studies  have  been  carried  out  to  validate  the 
method. 

We  have  made  the  important  discovery  that  there  exists  a  close  relationship  between  the 
model-reference  framework  and  the  well-known  stochastic  approximation  method  [28].  This 
relationship  explains  why  these  model-based  algorithms  work  well  for  hard  optimization 
problems  with  little  structure  -  by  implicitly  converting  and  transforming  the  underlying 
problem  into  an  equivalent  problem  on  the  parameter  space  with  smooth  differential  struc¬ 
tures  -  and  any  model-based  algorithm  that  can  be  accommodated  by  the  framework  can 
essentially  be  viewed  as  a  gradient-based  recursion  on  the  parameter  space  for  solving  the 
equivalent  smoothed  problem.  This  new  discovery  provides  a  unifying  framework  for  analyz¬ 
ing  the  convergence  and  convergence  rate  behavior  of  model-based  algorithms,  in  the  sense 
that  the  available  tools  and  results  of  stochastic  approximation  from  over  half  a  century 
can  be  applied  to  study  model-based  algorithms  for  general  non- differentiable  optimization 
problems.  Moreover,  this  equivalence  relationship  will  also  help  us  to  understand  the  ca¬ 
pability  and  limitation  of  model-based  algorithms,  and  provide  insight  into  designing  new 
algorithm  instantiations.  Our  empirical  implementation  of  model-based  algorithms  based  on 
a  pure  gradient  interpretation  indicates  significant  performance  improvement  over  their  orig¬ 
inal  versions.  We  believe  that  this  new  direction  will  eventually  lead  to  robust  and  efficient 
new  simulation  and  sampling-based  techniques  capable  of  handling  complex  optimization 
problem  involving  hundreds  of  decision  variables. 
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Another  important  aspect  we  have  worked  on  is  the  simulation/sampling  efficiency  issue 
in  model-based  algorithms.  This  is  motivated  by  the  fact  that  many  model-based  algorithms 
have  a  demanding  computational  requirement  per  iteration,  since  a  sufficient  number  of  can¬ 
didate  solutions  need  to  be  collected  to  update  the  probability  models  over  the  solution  space. 
Efficient  simulation  and  sampling  procedures  in  model-based  algorithms  will  help  to  capture 
reliable  information  in  updating  probability  models  with  only  a  modest  computational  effort, 
and  thus  further  enhance  the  value  of  such  algorithms.  Our  approach  is  based  on  model¬ 
ing  the  simulation  and  sampling  process  in  a  typical  model-based  algorithm  as  a  sequential 
decision  making  problem,  where  the  ultimate  goal  is  to  maximize  algorithm  performance 
subject  to  a  given  computational  budget  constraint.  Some  theoretical  and  empirical  findings 
are  contained  in  [22],  Our  results  show  that  for  high- dimensional  optimization  problems, 
the  proposed  computing  budget  allocation  scheme  could  yield  orders  of  magnitude  savings 
in  computational  effort. 

In  [52],  [50]  we  have  proposed  an  innovative  new  framework  for  optimization  problems 
based  on  particle  filtering  (also  called  Sequential  Monte  Carlo  method).  This  framework 
unifies  and  provides  new  insight  into  randomized  optimization  algorithms.  The  framework 
also  sheds  light  on  developing  new  optimization  algorithms  through  the  freedom  in  the 
framework  and  the  various  techniques  for  improving  particle  filtering. 

The  paper  [23]  considers  a  population-based  Model-based  Search  (MBS),  where  a  pop¬ 
ulation  of  probabilistic  models  is  maintained/updated  and  subsequently  propagated  from 
generation  to  generation.  Unlike  traditional  MBS,  one  of  the  key  questions  in  the  proposed 
approach  is  how  to  efficiently  distribute  a  given  sample  budget  among  different  models  in 
a  population  to  maximize  the  algorithm  performance.  We  formulate  this  problem  as  a 
MDP  model  and  derive  an  optimal  sampling  scheme  to  adaptively  allocate  computational 
resources.  In  particular,  the  proposed  sampling  scheme  assigns  to  each  model  a  performance 
index  to  determine  the  quality  of  the  model  and  samples  the  one  that  has  the  current  best 
index.  These  performance  indices  are  then  further  used  in  conjunction  with  a  variant  of 
the  recently  proposed  cross-entropy  (CE)  method  to  update  the  current  population  to  pro¬ 
duce  an  improving  population  of  models.  We  carry  out  numerical  studies  to  illustrate  the 
algorithm  and  compare  its  performance  with  existing  procedures. 

2.1.3  Applications 

We  have  studied  an  application  in  data  mining,  specifically  model-based  cluster  analysis 
involving  maximum  likelihood  estimation.  For  large  data  sets  with  many  clusters,  the  re¬ 
sulting  log-likelihood  function  has  many  local  optima,  so  traditional  statistical  techniques 
such  as  the  popular  Expectation-Maximization  (EM)  algorithm  often  fail  to  find  the  global 
maximum.  We  have  recently  applied  our  MRAS  method  clustering  problems  [20]  and  com¬ 
pared  it  with  the  EM  method.  Although  the  EM  method  is  faster  in  converging  to  a  local 
optimum,  for  problems  with  many  local  optima,  it  often  fails  to  find  the  global  optimum 
found  by  the  other  two  methods,  even  with  many  restarts. 

We  have  also  made  considerable  advances  in  applying  our  sampling  and  model-based 
framework  and  related  techniques  to  solving  statistical  problems.  For  example,  we  have 
proposed  [29]  two  adaptive  resampling  algorithms  for  estimating  bootstrap  distributions. 
One  algorithm  applies  the  CE  method  and  does  not  require  calculation  of  the  resampling 
probability  weights  via  numerical  optimization  methods  (e.g.,  Newton’s  method),  whereas 


7 


the  other  algorithm  can  be  viewed  as  a  multi-stage  extension  of  the  classical  two-step  vari¬ 
ance  minimization  approach.  The  two  algorithms  can  be  easily  used  as  part  of  a  general 
algorithm  for  Monte  Carlo  calculation  of  bootstrap  confidence  intervals  and  tests,  and  is 
especially  useful  in  estimating  rare  event  probabilities.  We  analyze  theoretical  properties  of 
both  algorithms  in  an  idealized  setting  and  carry  out  simulation  studies  to  demonstrate  their 
performance.  In  [30],  we  propose  an  adaptive  importance  resampling  algorithm  for  estimat¬ 
ing  bootstrap  quantiles  of  general  statistics.  The  algorithm  is  especially  useful  in  estimating 
extreme  quantiles  and  can  be  easily  used  to  construct  bootstrap  confidence  intervals.  Em¬ 
pirical  results  on  real  and  simulated  data  sets  show  that  the  proposed  algorithm  is  not  only 
superior  to  the  uniform  resampling  approach,  but  may  also  provide  more  than  an  order  of 
magnitude  of  computational  efficiency  gains.  We  have  introduced  [43]  a  multi-step  variance 
minimization  algorithm  for  numerical  estimation  of  Type  I  and  Type  II  error  probabilities 
in  sequential  tests.  The  algorithm  can  be  applied  to  general  test  statistics  and  easily  built 
into  general  design  algorithms  for  sequential  tests.  Our  simulation  results  indicate  that  the 
proposed  algorithm  is  particularly  useful  for  estimating  tail  probabilities,  and  may  lead  to 
significant  computational  efficiency  gains  over  the  crude  Monte  Carlo  method. 

Many  clinical  trials  involve  a  sequential  stopping  rule  to  specify  the  conditions  under 
which  a  study  might  be  terminated  earlier  before  its  planned  completion.  The  most  impor¬ 
tant  issue  in  the  design  stage  is  to  determine  the  common  operating  characteristics  such  as 
Type  I  and  Type  II  error  rates,  for  which  crude  Monte  Carlo  simulation  methods  are  widely 
adopted.  However,  it  is  well  known  that  crude  Monte  Carlo  may  lead  to  large  variabilities  in 
resultant  estimates  and  excessive  waste  of  computational  resources.  In  [31],  we  propose  an 
efficient  importance  sampling  approach  for  determining  type  I  and  type  II  error  rates  in  both 
fully  sequential  and  group  sequential  clinical  trial  designs  with  either  immediate  responses 
or  survival  endpoints.  The  approach  is  insensitive  to  the  underlying  statistics  of  interest, 
and  can  be  easily  built  into  a  general  algorithm  to  evaluate  error  rates,  determine  sample 
sizes,  test  statistical  hypotheses,  and  construct  confidence  intervals. 

Dose-response  studies  are  routinely  conducted  in  clinical  trials  to  determine  viable  dose 
levels  for  newly  developed  therapeutic  drugs.  Due  to  safety,  efficacy,  and  experimental  design 
considerations,  practical  constraints  are  often  imposed  on  (1)  dose  range  (e.g.  restricted 
dose  range),  (2)  dose  levels  (e.g.  the  inclusion  of  placebo),  (3)  dose  numbers  (e.g.  no  more 
than  four  dose  groups),  (4)  dose  proportions  (e.g.  at  least  20  percent  of  the  subjects  will 
be  allocated  to  the  placebo)  and  (5)  potential  missing  trials.  We  propose  [32]  controlled 
optimal  designs,  that  is,  Bayesian  multiple-objective  optimal  designs  satisfying  one  or  more 
of  these  practical  constraints,  for  dose  response  studies.  The  resulting  controlled  optimal 
designs  satisfying  these  realistic  constraints  can  be  readily  adopted  by  the  pharmaceutical 
researchers  for  optimal  estimation  of  the  parameters  of  interest  such  as  the  median  effective 
dose  level  or  the  threshold  dose  level.  We  demonstrate  our  results  and  methodology  through 
the  logistic  dose  response  model  although  our  approach  is  viable  for  virtually  any  dose 
response  model. 

We  have  contributed  to  important  advances  the  efficient  allocation  of  a  simulation  bud¬ 
get.  We  have  previously  considered  the  problem  of  efficiently  allocating  simulation  replica¬ 
tions  in  order  to  maximize  the  probability  of  selecting  the  best  design  under  the  scenario 
in  which  system  performances  are  sampled  in  the  presence  of  correlation.  For  a  general 
number  of  competing  designs,  an  approximation  for  the  asymptotically  optimal  allocation  is 


obtained,  which  coincides  with  the  independent  case  derived  previously  in  the  limit  as  the 
correlation  vanishes.  An  allocation  algorithm  based  on  the  approximation  is  proposed  and 
tested  on  several  numerical  examples.  We  have  proposed  [19]  an  optimal  computing  bud¬ 
get  allocation  (OCBA)  method  to  improve  the  efficiency  of  simulation  optimization  using 
the  CE  method.  In  the  stochastic  simulation  setting  where  replications  are  expensive  but 
noise  in  the  objective  function  estimate  could  mislead  the  search  process,  the  allocation  of 
simulation  replications  can  make  a  significant  difference  in  the  performance  of  such  global 
optimization  search  algorithms.  The  OCBA  approach  proposed  here  improves  the  updating 
of  the  sampling  distribution  by  carrying  out  this  allocation  in  an  efficient  manner.  Numerical 
experiments  indicate  that  the  integration  of  OCBA  with  the  CE  method  provides  substantial 
computational  improvement . 

In  [6],  [21]  we  have  presented  a  new  sampling-based  algorithm  for  solving  stochastic 
discrete  optimization  problems.  The  algorithm  solves  the  sample  average  approximation 
(SAA)  of  the  original  problem  by  iteratively  updating  and  sampling  from  a  probability 
distribution  over  the  search  space.  We  show  that  as  the  number  of  samples  goes  to  infinity, 
the  value  returned  by  the  algorithm  converges  to  the  optimal  objective- function  value  and  the 
probability  distribution  to  a  distribution  that  concentrates  only  on  the  set  of  best  solutions 
of  the  original  problem.  We  then  extend  the  algorithm  to  solving  finite-horizon  MDPs, 
where  the  underlying  MDP  is  approximated  by  a  recursive  SAA  problem.  We  show  that  the 
estimate  of  the  recursive  sample-average-maximum  computed  by  the  extended  algorithm  at 
a  given  state  approaches  the  optimal  value  of  the  state  as  the  sample  size  per  state  per 
stage  goes  to  infinity.  The  recursive  algorithm  for  MDPs  is  then  further  extended  to  finite- 
horizon  two-person  zero-sum  Markov  games  (MGs),  providing  a  finite-iteration  bound  to 
the  equilibrium  value  of  the  induced  SAA  game  problem  and  asymptotic  convergence  to  the 
equilibrium  value  of  the  original  game.  The  time  and  space  complexities  of  the  extended 
algorithms  for  MDPs  and  MGs  are  independent  of  their  state  spaces. 

In  another  application  of  simulation  optimization,  we  have  studied  [16]  the  problem  of  de¬ 
termining  the  optimal  control  limits  of  control  charts,  which  requires  estimating  the  gradient 
of  the  expected  cost  function.  Simulation  is  a  very  general  methodology  for  estimating  the 
expected  costs,  but  for  estimating  the  gradient,  straightforward  finite  difference  estimators 
can  be  inefficient.  We  demonstrate  an  alternative  approach  based  on  smoothed  perturbation 
analysis  (SPA),  also  known  as  conditional  Monte  Carlo.  Numerical  results  and  consequent 
design  insights  are  obtained  in  determining  the  optimal  control  limits  for  EWMA  and  Bayes 
charts.  The  results  indicate  that  the  SPA  gradient  estimators  can  be  significantly  more  effi¬ 
cient  than  finite  difference  estimators,  and  that  a  simulation  approach  using  these  estimators 
provides  a  viable  alternative  to  other  numerical  solution  techniques  for  the  economic  design 
problem. 

2.2  Simulation-Based  and  Sampling  Methods  for  Markov  Decision 
Processes 

We  define  an  MDP  {Xi,i  =  0, 1,  ...,T}  on  state  space  S  and  action  space  A  (cf.  e.g., 
[1,  5]).  In  period  (stage)  i,  the  MDP  in  state  X \  e  S  takes  action  a*  G  A,  incurs  cost 
Ci(Xi,  a,i,Ui),  where  u>i  denotes  the  stochastic  element  (e.g.,  random  number),  and  then 
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transitions  according  to 


-^Q+l  ft+l  (A/,  ^i)j 

where  /j(x,  a,  •)  denotes  the  (stochastic)  transition  function  in  period  i  for  action  a  taken 
in  state  x.  For  notational  simplicity,  we  have  not  made  state  and  action  spaces  period 
dependent. 

The  objective  is  to  find  a  feedback  control  policy  n  =  {^(x)}^1,  which  is  a  sequence  of 
decision  rules  specifying  the  action  at  taken  when  in  state  x  in  period  i,  that  minimizes  an 
expected  cost  function,  usually  either  finite  horizon  total  cost,  finite  horizon  discounted  total 
cost,  infinite  horizon  average  cost,  or  infinite  horizon  discounted  total  cost.  This  proposal 
focuses  on  the  discounted  total  cost  setting,  both  finite  and  infinite  horizon,  and  we  define 
the  value  function  associated  with  a  policy  and  initial  state: 


"T— 1 

1 

V*(x)  =  E 

.  t=0 

1 

^7 

(2) 


where  a  is  the  (one-period)  discount  factor  and  T  could  be  infinite,  under  the  assumption 
that  the  limit  is  then  well  defined.  As  stated  earlier,  the  chief  context  is  the  setting  in  which 
simulation  is  required  to  generate  the  system  dynamics  (state  transitions)  and/or  period 
costs. 

We  begin  by  defining  some  familiar  quantities: 


Qi(x,a )  =  (expected)  cost-to-go  (Q-function)  in  period  i  for  action  a  taken  in  state  x 
and  optimal  actions  taken  henceforth; 

Vi(x)  =  optimal  value  function  in  period  i  for  state  x. 

Then  we  have  the  usual  Bellman  optimality  equation  [1,  38]: 

Vi(x)  =  inf  {E  [Ci(x,  a,u>i)  +  aVi+1(fi+1(x,a,LUi))]}  ,  (3) 

a 

written  here  in  two-part  form: 


Qi(x,a)  =  E  [Ci(s,  a,LVi)  +  aVi+1(fi+1(x,  a,Ui))} ,  (4) 

Vi(x)  =  inf(3j(x,a).  (5) 

a 

An  optimal  policy  in  period  i  will  be  denoted  by 

7r*  (x)  G  arg  inf  Qj  (x,  a),  i  —  0,...,T  —  l,  x  G  S.  (6) 

a 

When  the  policy  is  stationary,  the  subscript /argument  i  will  be  dropped.  In  the  infinite 
horizon  stationary  case,  (3)  takes  the  following  form: 

V(x)  =  inf  {E  [C(x,  a,u>)  +  aV(f(x ,  a,  a;))]}  ,  (7) 

a 

and  we  will  assume  there  exists  an  optimal  stationary  policy  such  that 

7 t*(x)  G  arg  inf  Q (x,  a),  x  G  S. 
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Input:  stage  i  <  T,  state  x  €  X,  Ni  >  0,  other  parameters.  (For  i  =  T,  VfiT  (x)  = 
VfiT(x)  =  0.) 

Initialization:  algorithm  parameters;  total  number  of  simulations  set  to  0. 

Loop  until  total  number  of  simulations  reaches  Nt\ 

•  Determine  an  action  a  to  simulate  next  state  via  f(x,a,u>),  oj  ~  U(0, 1). 

•  Update  the  following: 

number  of  times  action  a  has  been  sampled  Ni(x)  <—  Ni(x)  +  1, 

Q-function  estimate  based  on  C(x,a,u>)  and  Vi ,  j+1  (f(x,  a,  w)), 

the  current  optimal  action  estimate  (for  state  x  in  stage  i), 
and  other  algorithm-specific  parameters. 

Output:  ViNi(x)  based  on  Q-function  estimates  {Q^fix,  a)}. 


Figure  1:  Adaptive  multi-stage  (AMS)  sampling  framework 

Traditional  methods  of  policy  iteration,  value  iteration,  and  variants  based  on  linear  pro¬ 
gramming  all  suffer  from  the  curse  of  dimensionality.  Furthermore,  the  transition  function 
fi  is  generally  not  known  in  closed  form  (note  that  in  traditional  MDP  formulations,  it  is 
expressed  in  terms  of  explicit  transition  probabilities  assumed  given),  but  may  be  generated 
by  a  complicated  stochastic  simulation  model,  so  in  such  a  setting,  the  traditional  methods 
are  not  directly  applicable. 

2.2.1  Adaptive  Multi-stage  Sampling 

Adaptive  multi-stage  (AMS)  sampling  algorithms  [3,  4]  provide  procedures  for  accurately  and 
efficiently  estimating  the  optimal  value  function  under  the  constraint  that  there  is  a  finite 
number  of  simulation  replications  to  be  allocated  per  state  per  period.  These  algorithms 
adaptively  choose  which  action  to  sample  as  the  sampling  process  proceeds,  based  on  the 
estimates  obtained  up  to  that  point,  providing  value  function  estimators  that  converge  to 
the  true  value  asymptotically  in  the  number  of  simulation  replications  allocated  per  state. 
These  algorithms  are  targeted  at  finite-horizon  MDPs  with  large,  possibly  uncountable ,  state 
spaces  but  smaller  finite  action  spaces. 

Letting  V^'{x)  denote  the  estimate  of  the  optimal  value  function  Vfix)  based  on  Ni 
simulations  in  period  (stage)  i,  the  objective  is  to  estimate  the  optimal  value  V0(x0 )  for 
a  given  starting  state  x0.  The  approach  will  be  to  optimize  over  actions,  based  on  the 
recursive  optimality  equations  given  by  (4)  and  (5).  The  latter  involves  an  optimization  over 
the  action  space,  so  the  main  objective  of  the  approaches  is  to  adaptively  determine  which 
action  to  sample  next.  The  chosen  action  will  then  be  used  to  simulate  the  one-period  costs 
Cfix,  a,uf  j)  and  next  state  fi+ i(x,  which  are  used  to  update  the  estimate  of  Qfix,  a ) 

denoted  by  Q^(x,a),  which  in  turn  determines  the  estimate  V*1  {x).  Figure  1  provides  a 
generic  algorithm  outline  for  the  adaptive  multi-stage  sampling  framework. 

Specifically,  Qfix,  a)  is  estimated  for  each  state  x  and  action  a  e  A(x),  where  A(x)  is  the 
set  of  admissible  actions  in  state  x,  by  a  sample  mean  based  on  simulated  next  states  and 
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where  Nla(x )  is  the  number  of  times  action  a  has  been  sampled  from  state  x  in  period  (stage) 
i  (. Ni  =  Yha&A(x)  Nla(x))-i  and  the  sequence  {ojfj ,  j  =  l,.„7Nl(x)}  contains  the  correspond¬ 
ing  random  numbers  used  to  simulate  the  one-period  costs  Ci(x,  a,u;“-)  and  next  states 
fi+i(x,a,Uij).  Note  that  the  number  of  next-state  samples  depends  on  the  state  x,  action 
a,  and  stage  i. 

In  the  general  framework  that  estimates  the  Q-function  via  (8),  the  total  number  of 
sampled  (next)  states  is  0(NT )  with  N  =  maXj=0,...,T-i  iVj,  which  is  independent  of  the 
state  space  size.  One  approach  is  to  select  “optimal”  values  of  7V*(x)  for  i  =  0,  ...,T  —  1, 
a  G  A(x),  and  x  G  X,  such  that  the  expected  error  between  the  values  of  V^°(x)  and 
Vo(x)  is  minimized,  but  this  problem  would  be  difficult  to  solve.  We  have  developed  two 
algorithms.  Both  construct  a  sampled  tree  in  a  recursive  manner  to  estimate  the  optimal 
value  at  an  initial  state  and  incorporate  an  adaptive  sampling  mechanism  for  selecting  which 
action  to  sample  at  each  branch  in  the  tree.  The  upper  confidence  bound  (UCB)  sampling 
algorithm  chooses  the  next  action  based  on  the  exploration-exploitation  tradeoff  captured 
by  a  multi-armed  bandit  model,  whereas  in  the  pursuit  learning  automata  (PLA)  sampling 
algorithm,  the  action  is  sampled  from  a  probability  distribution  over  the  action  space,  where 
the  distribution  tries  to  concentrate  mass  on  ( “pursue” )  the  estimate  of  the  optimal  action. 
The  analysis  of  the  UCB  sampling  algorithm  is  given  in  terms  of  the  expected  bias  [3], 
whereas  for  the  PLA  sampling  algorithm  we  provide  a  probability  bound  [4], 


2.2.2  Population-Based  Randomized  Methods  for  MDPs 

Population-based  randomized  algorithms  for  MDPs  directly  search  the  policy  space  to  avoid 
carrying  out  an  optimization  over  the  entire  action  space  at  each  policy  iteration  step,  and 
update  a  population  of  policies  as  in  genetic  algorithms  (GAs),  using  appropriate  analogous 
operations  for  the  MDP  setting.  The  hope  is  that  a  population-based  approach  provides 
robustness,  similar  to  GAs  and  scatter  search  in  deterministic  combinatorial  optimization. 
The  literature  applying  evolutionary  algorithms  such  as  GAs  for  solving  MDPs  is  relatively 
sparse  (e.g.,  [11,  34]).  We  have  developed  new  population-based  algorithms  in  [8,  27]. 

The  first  key  feature  of  the  algorithms  in  [8,  27]  needed  to  ensure  convergence  to  an 
optimal  policy  is  an  elitist  policy  that  has  a  value  function  at  least  as  good  as  the  best  value 
function  in  the  previous  population,  i.e.,  7?  G  ITS  is  an  elitist  policy  for  A  C  Iffi  if  V7T  G  A, 

V*{x)  <  Vn(x)  Mx  G  5. 


If  nk  denotes  an  elitist  policy  for  generation  k ,  then  it  has  the  following  monotonicity  prop- 
Grtv. 

V *  (, X )  <  (x)  \/x  G  5. 


We  use  the  term  “elitist”  (single  policy)  to  distinguish  it  from  “elite”  (set  of  candidate 
solutions)  in  traditional  GAs.  The  other  key  feature  is  an  action  selection  distribution  that 
generates  mutations  of  policies  to  explore  the  policy  space.  An  action  selection  distribution 
Vx  for  state  x  G  S  is  a  probability  distribution  over  the  action  space  A.  The  monotonicity 
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Input:  population  size  n  >  1,  initial  population  Ao,  action  selection  distribution  Vx  V  x  €  S, 
other  policy  mutation  parameters. 

Initialization:  Set  iteration  count  k  =  0. 

Loop  until  Stopping  Rule  is  satisfied: 

•  Generate  an  Elitist  Policy  nk  based  on  A*,. 

•  Exploration:  Generate  (n  —  1)  policies  {ir1, . . .  ,jrn~1} 

via  mutation  operators  and  sampling  from  action  selection  distribution  Vx. 

•  Next  Population  Generation:  Afc+i={7ffe,7r1,. ..  ,7r"  1},  i=l, n  —  1. 

•  k  <—  k  +  1. 

Output:  ifk  an  estimated  optimal  policy. 


Figure  2:  Population-based  evolutionary  framework  for  MDPs. 

property  of  the  elitist  policy  and  the  exploration  property  of  the  action  selection  distribution 
ensure  that  the  algorithms  converge  to  a  population  in  which  the  elitist  policy  is  an  optimal 
policy.  A  description  of  a  general  framework  for  the  population-based  evolutionary  approach 
is  provided  in  Fig.  2,  where  A*.  C  FR  denotes  the  A:th  generation  population  of  policies  and 
n  —  |Afc|  >  1  is  the  constant  population  size. 

We  have  developed  two  algorithms:  evolutionary  policy  iteration  (EPI)  and  evolutionary 
random  policy  search  (ERPS).  These  algorithms  are  especially  targeted  to  problems  where 
the  state  space  is  relatively  small  but  the  action  space  is  extremely  large,  so  that  the  policy 
improvement  step  in  policy  iteration  (PI)  becomes  computationally  impractical.  They  elim¬ 
inate  the  operation  of  maximization  over  the  entire  action  space  in  the  policy  improvement 
step  by  directly  manipulating  policies  to  generate  an  elitist  policy.  In  earlier  work  under 
AFOSR  support,  we  developed  EPI  [8],  which  uses  a  technique  called  policy  switching  [7]  to 
generate  an  elitist  policy  from  a  set  of  given  policies,  with  a  computation  time  on  the  order 
of  the  size  of  the  state  space. 

Evolutionary  Random  Policy  Search 

Evolutionary  Random  Policy  Search  (ERPS)  is  an  enhancement  of  EPI  that  improves  upon 
both  the  elitist  policy  determination  and  the  mutation  step  by  solving  a  sequence  of  sub- 
MDP  problems  defined  on  smaller  policy  spaces.  As  in  EPI,  each  iteration  of  ERPS  has  two 
main  steps:  (i)  generating  an  elitist  policy,  but  using  policy  improvement  with  cost  swapping 
(PICS)  instead  of  policy  switching,  and  (ii)  exploring  the  policy  using  a  “nearest  neighbor” 
heuristic  along  with  sampling  of  the  entire  action  space. 

Policy  Improvement  with  Cost  Swapping 

ERPS  splits  a  potentially  large  MDP  problem  into  a  sequence  of  smaller  MDPs,  to  extract 
a  convergent  sequence  of  policies  via  solving  these  smaller  problems.  For  a  given  policy 
population  A,  if  we  restrict  the  original  MDP  (e.g.,  costs,  transition  probabilities)  to  the 
subsets  of  actions  r(x)  :=  (7r(a;)  :  tt  G  A}  \/  x  G  S,  then  a  sub-MDP  problem  is  induced  from 
A  as  Ga  '■=  (<S,  r, /,  c),  where  V  :=  lja,r(x)  C  A.  Note  that  in  general  r(x)  is  a  multi-set, 
which  means  that  the  set  may  contain  repeated  elements;  however,  we  can  always  discard 
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the  redundant  members  and  view  T(x)  as  the  set  of  admissible  actions  at  state  x. 

Given  a  nonempty  finite  subset  A  C  Ils,,  a  policy  7rpics  generated  by  policy  improvement 
with  cost  swapping  (PICS)  with  respect  to  the  sub-MDP  Q,\  is  given  by 

7Tpics(^)  e  argrnin  { E[C(x,a,Lu)\  +  aE  [VA(f(x,  a, a;))]  ),  (9) 

aGr(rr)  ^  ' 

where  VA{x)  =  min^-gA  Vn[x)  Vs  6  5.  PICS  is  a  variation  of  the  policy  improvement  step  in 
policy  iteration  performed  on  the  “swapped  cost”  VA(x)  =  min^A  V7r(x),  hence  the  name 
“policy  improvement  with  cost  swapping”.  Note  that  the  “swapped  cost”  VA(x)  may  not 
be  the  value  function  corresponding  to  any  single  policy  in  A.  However,  the  following  result 
shows  that  the  elitist  policy  generated  by  PICS  improves  any  policy  in  A. 

Theorem  2  Consider  a  nonempty  finite  subset  A  C  ns  and  the  policy  7rpics  generated  by 
PICS  with  respect  to  Q\  given  by  (9).  Then,  for  all  x  £  S,  V"*pics(x)  <  VA{x).  Furthermore, 
if  tt p^s  is  not  optimal  for  the  sub-MDP  Q\ ,  then  VKpics(x)  <  VA[x)  for  at  least  one  x  £  S. 

According  to  this  theorem,  in  each  step  of  ERPS,  the  elitist  policy  nk  generated  by 
PICS  with  respect  to  the  current  sub-MDP  Q\k,  as  given  by  (9),  improves  any  policy  in 
A*,.  Thus,  the  new  population  A^+i  contains  a  policy  that  is  superior  to  any  policy  in  the 
previous  population.  Since  i rk  is  directly  used  to  generate  the  (k  +  1  )st  sub-MDP  ,  the 
desired  monotonicity  property  follows  by  induction. 

Corollary  1  Under  ERPS,  for  all  k  >  0,  Vnk+1  (x)  <  V7rk(x)  Vi  G  5. 

The  computational  complexity  of  each  iteration  of  PICS  is  approximately  the  same  as 
that  of  policy  switching,  because  the  policy  evaluation  step  of  PICS,  which  is  also  used 
by  policy  switching,  requires  solution  of  n  systems  of  linear  equations,  and  the  number  of 
operations  required  by  using  a  direct  method  (e.g.,  Gaussian  Elimination)  is  0(n|iS|3),  and 
this  dominates  the  computational  complexity  of  the  policy  improvement  step,  which  is  at 
most  0(n|5|2). 

Exploration  via  policy  mutation  and  local  nearest  neighbor  search 

After  PICS  is  used  to  generate  an  elitist  policy,  n  —  1  other  policies  7p,  i  —  1,  ...,n  —  1,  are 
generated  by  the  following  procedure: 

With  probability  go  (exploitation) 

choose  action  nl(x)  in  neighborhood  of  i rk(x),  x  G  S,  using  “nearest  neighbor” 
heuristic. 

else  (with  probability  1  —  g0)  (exploration) 

choose  action  7rl(x)  £  A  according  to  Vx,  x  £  S. 

These  policies  are  then  combined  with  the  elitist  policy  to  form  the  next  generation. 

For  a  metric  space  A  with  metric  d(-,  •),  we  have  the  following  convergence  result. 

Theorem  3  Let  n*  be  an  optimal  policy  with  corresponding  value  function  Vn * ,  and  let  the 
sequence  of  elitist  policies  generated  by  ERPS  together  with  their  corresponding  value  func¬ 
tions  be  denoted  by  {7rfc,  k  —  1,  2, . . .}  and  { V *  ,  k  —  1,2,...},  respectively.  If  (i)  q0  <  1,  (ii) 
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the  action  selection  distribution  satisfies  for  any  I  >  0,  Vx({a\  d(a,n*(x))  <  I,  a  G  *4})  > 
0,  Vi  6  5,  and  (in)  the  transition  matrix/ function  and  one-stage  cost  function  satisfy 
Lipschitz-type  conditions,  then 

V7rk(x) — >Vn*(x)  \/ x  e  S  w.p.l. 

2.2.3  Simulation-Based  Approach  to  POMDPs 

In  a  simulation-based  approach  to  POMDPs,  we  have  developed  in  [49,  51]  a  computationally 
viable  and  theoretically  sound  method  for  solving  continuous-state  POMDPs  by  effectively 
reducing  the  dimensionality  of  the  belief  space  via  density  projections.  The  density  projection 
technique  is  also  incorporated  into  particle  filtering  (state  estimation  using  simulations) 
to  provide  a  filtering  scheme  for  online  decision-making.  We  have  proved  an  error  bound 
between  the  value  function  induced  by  the  policy  obtained  by  our  method  and  the  true  value 
function  of  the  POMDP,  and  also  an  error  bound  between  the  projection  particle  filter  and 
the  optimal  filter.  Finally,  we  have  illustrated  the  effectiveness  of  our  method  through  an 
inventory  control  problem. 

3  Additional  Research  Progress 

We  have  also  made  significant  progress  in  the  following  areas: 

•  Perturbation  analysis  of  a  dynamic  priority  call  center  ([10]); 

•  Conditional  Monte  Carlo  estimation  of  quantile  sensitivites  ([36]); 

•  Invited  papers  on  simulation  optimization  and  its  applications  [15],  [35],  [14],  [9].. 

4  Research  Output 

4.1  Journal  Publications 

•  H.S.  Chang,  M.C.  Fu,  J.  Hu,  and  S.I.  Marcus.  A  Survey  of  Some  Simulation-based 
Methods  in  Markov  Decision  Processes,  Communications  in  Information  and  Systems, 
7,  59-92,  2007. 

•  H.S.  Chang,  M.C.  Fu,  J.  Hu,  and  S.I.  Marcus.  Recursive  Learning  Automata  Approach 
to  Markov  Decision  Processes,  IEEE  Transactions  on  Automatic  Control,  52,  1349- 
1355,  2007. 

•  H.S.  Chang,  J.  Hu,  M.C.  Fu,  and  S.I.  Marcus,  Adaptive  Adversarial  Multi-Armed 
Bandit  Approach  to  Two-Person  Zero-Sum  Markov  Games,  IEEE  Transactions  on 
Automatic  Control,  forthcoming,  2010. 

•  M.  Chen,  J.Q.  Hu,  and  M.C.  Fu.  Perturbation  Analysis  of  a  Dynamic  Priority  Call 
Center,  IEEE  Transactions  on  Automatic  Control ,  forthcoming,  2010. 
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•  C.H.  Chen,  D.  He,  M.C.  Fu,  and  L.H.  Lee.  Efficient  Simulation  Budget  Allocation  for 
Selecting  an  Optimal  Subset,  INFORMS  Journal  on  Computing ,  Vol.20,  No. 4,  579-595, 
2008. 

•  M.C.  Fu.  What  You  Should  Know  About  Simulation  and  Derivatives  (Cover  Story), 
Naval  Research  Logistics,  Vol.55,  No. 8,  723-736,  2008. 

•  M.C.  Fu,  L.J.  Hong  and  J.Q.  Hu.  Conditional  Monte  Carlo  Estimation  of  Quantile 
Sensitivities,  Management  Science,  Vol.55,  No.  12,  2019-2027,  2009. 

•  M.C.  Fu,  S.  Lclc,  and  T.  Vossen.  Conditional  Monte  Carlo  Gradient  Estimation  in 
Economic  Design  of  Control  Limits,  Production  &  Operations  Management,  Vol.  18, 
No.  1,  60-77,  2009. 

•  D.  He,  L.H.  Lee,  C.H.  Chen,  M.C.  Fu,  and  S.  Wasserkrug.  Simulation  Optimization 
Using  the  Cross-Entropy  Method  with  Optimal  Computing  Budget  Allocation,  ACM 
Transactions  on  Modeling  and  Computer  Simulation,  in  press. 

•  J.W.  Heath,  M.C.  Fu,  and  W.  Jank.  New  Global  Optimization  Algorithms  for  Model- 
Based  Clustering,  Computational  Statistics  and  Data  Analysis,  in  press. 

•  J.  Hu,  M.C.  Fu,  V.  Ramezani,  and  S.I.  Marcus.  An  Evolutionary  Random  Policy  Search 
Algorithm  for  Solving  Markov  Decision  Processes,  INFORMS  Journal  on  Computinq, 
19,  161-174,  2007. 

•  J.  Hu,  M.C.  Fu,  and  S.I.  Marcus.  A  Model  Reference  Adaptive  Search  Algorithm  for 
Global  Optimization,  Operations  Research,  55,  549-568,  2007. 

•  J.  Hu  and  Z.  Su.  Efficient  Error  Determination  in  Sequential  Clinical  Trial  Design, 
Journal  of  Computational  and  Graphical  Statistics,  17,  925-948,  2008. 

•  J.  Hu  and  Z.  Su.  Adaptive  Resampling  Algorithms  for  Estimating  Bootstrap  Distri¬ 
butions,  Journal  of  Statistical  Planning  and  Inference,  138,  1763-1777,  2008. 

•  J.  Hu  and  Z.  Su.  Bootstrap  Quantile  Estimation  via  Importance  Resampling,  Compu¬ 
tational  Statistics  and  Data  Analysis,  52,  5136-5142,  2008. 

•  J.  Hu,  M.C.  Fu,  and  S.  I.  Marcus.  A  Model  Reference  Adaptive  Search  Method  for 
Stochastic  Global  Optimization,  Communications  in  Information  and  Systems,  8,  245- 
276,  2009. 

•  J.  Hu,  H.S.  Chang,  M.C.  Fu,  and  S.I.  Marcus.  Dynamic  Sample  Budget  Allocation  in 
Model-Based  Optimization,  Journal  of  Global  Optimization,  forthcoming,  2010. 

•  J.  Hu,  W.  Zhu,  Y.  Su,  and  W.  K.  Wong.  Controlled  Optimal  Design  Program  for  the 
Logit  Dose  Response  Model,  Journal  of  Statistical  Software,  forthcoming,  2010. 

•  H.G.  Lee,  A.  Arapostathis,  and  S.I.  Marcus,  Necessary  and  Sufficient  Conditions 
for  State  Equivalence  to  a  Nonlinear  Discrete-Time  Observer  Canonical  Form,  IEEE 
Transactions  on  Automatic  Control,  53,  2701-2707,  2008. 
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•  Z.  Su,  J.  Hu,  and  W.  Zhu.  “Multi-Step  Variance  Minimization  in  Sequential  Tests,” 
Statistics  and  Computing,  18,  101-108,  2008. 

•  E.  Zhou,  M.C.  Fu,  and  S.I.  Marcus.  Projection  Particle  Filtering  for  Dimension  Re¬ 
duction  in  Continuous-time  POMDPs,  IEEE  Transaction  on  Automatic  Control,  forth¬ 
coming,  2010. 

4.2  Refereed  Proceedings  or  Book  Chapters 

•  H.  S.  Chang,  M.  C.  Fu,  and  S.  I.  Marcus.  Adversarial  Multi-Armed  Bandit  Approach 
to  Two-Person  Zero-Sum  Markov  Games,  Proceedings  of  the  46th  IEEE  Conference  on 
Decision  and  Control,  December  2007. 

•  C.H.  Chen,  M.C.  Fu,  and  L.  Shi.  Simulation  and  Optimization,  Tutorials  in  Operations 
Research,  Z.L.  Chen  and  S.  Raghavan,  editors,  INFORMS,  247-260,  2008. 

•  M.C.  Fu.  Variance-Gamma  and  Monte  Carlo,  Advances  in  Mathematical  Finance, 
M.C.  Fu,  R.A.  Jarrow,  J.-Y.  Yen,  and  R.J.  Elliott,  editors,  Birkhauser,  21-35,  2007. 

•  M.C.  Fu,  C.H.  Chen,  and  L.  Shi.  Some  Topics  in  Simulation  Optimization,  Proceedings 
of  the  2008  Winter  Simulation  Conference,  Miami,  FL,  Dec.  7-10,  2008. 

•  J.  Hu,  M.  C.  Fu,  and  S.  1.  Marcus.  A  Model  Reference  Adaptive  Search  Method  for 
Stochastic  Optimization  with  Applications  to  Markov  Decision  Processes,  Proceedings 
of  the  46th  IEEE  Conference  on  Decision  and  Control,  December  2007. 

•  J.  Hu  and  H.  S.  Chang.  A  Population-Based  Cross-Entropy  Method  with  Dynamic 
Sample  Allocation,  Proceedings  of  47th  IEEE  Conference  on  Decision  and  Control, 
2426-2431,  2008. 

•  J.  Hu  and  P.  Hu.  On  the  Performance  of  the  Cross-Entropy  Method,  Proceedings  of 
the  2009  Winter  Simulation  Conference,  459-468,  2009. 

•  U.  Kuter  and  J.  Hu.  Computing  and  Using  Lower  and  LIpper  Bounds  for  Action  Elimi¬ 
nation  in  MDP  Planning,  Proceedings  of  the  7th  Symposium  on  Abstraction,  Reformu¬ 
lation  and  Approximation  (SARA-07),  Springer  Lecture  Notes  in  Computer  Science 
(4612),  243-257,  2007. 

•  A.  Rawat,  H.  La,  M.  Shayman,  and  S.I.  Marcus.  Minimum  Wavelength  Assignment  for 
Multicast  Traffic  in  All-Optical  WDM  Tree  Networks,  Proceedings  of  the  5th  Interna¬ 
tional  Conference  on  Broadband  Communications,  Networks,  and  Systems  (BROAD- 
NETS  2008),  London,  UK,  September  8-11,  2008. 

•  Y.  Wang,  M.C.  Fu,  and  S.I.  Marcus.  A  New  Stochastic  Gradient  Estimator  for  Amer¬ 
ican  Option  Pricing,  Proceedings  of  the  European  Control  Conference,  Budapest,  Hun¬ 
gary,  August  23-26,  2009. 

•  Y.  Wang,  M.C.  Fu,  and  S.I.  Marcus.  Sensitivity  Analysis  for  Barrier  Options,  Proceed¬ 
ings  of  the  2009  Winter  Simulation  Conference,  Austin,  TX,  Dec.  13-16,  2009. 
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•  Y.  Wang,  M.C.  Fu,  and  S.I.  Marcus.  Dynamic  Pricing  with  Continuous  Stochastic 
Demand,  Proceedings  of  the  48th  IEEE  Conference  on  Decision  and  Control,  Shanghai, 
China,  Dec.  16-18,  2009. 

•  E.  Zhou,  M.C.  Fu,  and  S.I.  Marcus.  Solving  Continuous-State  POMDPs  via  Projection 
Particle  Filtering,  Proceedings  of  Eighth  International  Conference  on  Monte  Carlo  and 
Quasi-Monte  Carlo  Methods  in  Scientific  Computing,  Montreal,  Canada,  July  6-11, 
2008. 

•  E.  Zhou,  M.C.  Fu,  and  S.I.  Marcus.  A  Particle  Filtering  Framework  for  Random¬ 
ized  Optimization  Algorithms,  Proceedings  of  the  2008  Winter  Simulation  Conference, 
Miami,  FL,  Dec.  7-10,  2008. 

•  E.  Zhou,  M.C.  Fu,  and  S.I.  Marcus.  A  Density  Projection  Approach  to  Dimension 
Reduction  for  Continuous-State  POMDPs,  Proceedings  of  the  IEEE  Conference 
on  Decision  and  Control,  Cancun,  Mexico,  Dec.  9-11,  2008. 

•  E.  Zhou,  M.C.  Fu,  and  S.I.  Marcus.  A  Numerical  Method  for  Financial  Decision 
Problems  Under  Stochastic  Volatility,  Proceedings  of  the  2009  Winter  Simulation 
Conference,  Austin,  TX,  Dec.  13-16,  2009. 

4.3  Authored  Books  or  Monographs 

•  H.S.  Chang,  M.C.  Fu,  J.  Hu,  and  S.I.  Marcus.  Simulation-based  Algorithms  for  Markov 
Decision  Processes,  Springer- Verlag,  2007  (research  monograph). 

4.4  Edited  Volumes 

•  M.C.  Fu,  R.A.  Jarrow,  J.-Y.  Yen,  and  R.  J.  Elliott,  editors,  Advances  in  Mathematical 
Finance,  Birkhauser,  2007. 

4.5  Awards 

•  Michael  Fu:  Elected  Fellow  of  the  Institute  of  Electrical  and  Electronics  Engineers 
(IEEE). 

•  Michael  Fu:  Elected  Fellow  of  the  Institute  for  Operations  Research  and  the  Manage¬ 
ment  Sciences  (INFORMS). 

•  Steve  Marcus:  Elected  Fellow  of  the  Society  for  Industrial  and  Applied  Mathematics 
(SIAM). 

•  The  paper  “A  Numerical  Method  for  Financial  Decision  Problems  under  Stochastic 
Volatility,”  by  Enlu  Zhou,  Kun  Lin,  Michael  Fu,  and  Steve  Marcus,  won  the  Best 
Theoretical  Paper  Award  at  the  2009  Winter  Simulation  Conference  (WSC),  December 
13-16,  in  Austin,  Texas. 
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4.6  Ph.D.  Students 

•  Pedram  Fard,  Ph.D.,  2007,  Univ.  of  Maryland,  supervised  by  S.  Marcus,  M.  Shay- 
man,  and  H.  La,  “Dynamic  Configuration  of  Network  Topology  in  Optical  Networks” 
(http://hdl.handle.net/1903/7412)  (currently:  Global  Protocols) 

•  Huiju  Zhang,  Ph.D.,  2007,  Univ.  of  Maryland,  supervised  by  M.  Fu,  “Three  essays  on 
stochastic  optimization  applied  in  financial  engineering  and  inventory  management,” 
(http://hdl.handle.net/1903/6739)  (currently:  Fitch  Ratings) 

•  Jeffrey  Heath,  Ph.D.,  2007,  Univ.  of  Maryland,  supervised  by  M.  Fu,  “Global  optimiza¬ 
tion  of  finite  mixture  models,”  (http://hdl.handle.net/1903/7179)  (currently:  Centre 
College) 

•  Scott  Nestler,  Ph.D.,  2007,  Univ.  of  Maryland,  supervised  by  M.  Fu,  “Empirical  analy¬ 
ses  on  federal  thrift  savings  plan  portfolio  optimization,”  (http: / /hell. handle. net/ 1903/7749) 
(currently:  US  Military  Academy,  West  Point) 

•  Andrew  Hall,  Ph.D.,  2009,  Univ.  of  Maryland,  supervised  by  M.  Fu,  “Simulating  and 
Optimizing:  Military  Manpower  Modeling  and  Mountain  Range  Options”  (currently: 

US  Military  Academy,  West  Point) 

•  Matthew  Reindorp,  Ph.D.,  2009,  Univ.  of  Maryland,  supervised  by  M.  Fu,  “Industrial 
Flexibility  in  Theory  and  Practice”  (currently:  Technical  LIniversity  of  Eindhoven) 

•  Abraham  Thomas,  Ph.D.,  2009,  LIniv.  of  Maryland,  supervised  by  S.  Marcus,  “Learn¬ 
ing  Algorithms  for  Markov  Decision  Processes.” 

•  Enlu  Zhou,  Ph.D.,  2009,  LIniv.  of  Maryland,  supervised  by  S.  Marcus  and  M.  Fu, 
“Particle  Filtering  for  Stochastic  Control  and  Global  Optimization”  (currently:  LIniv. 
of  Illinois  Urbana-Champaign) 

•  Ping  Hu,  Ph.D  expected  2011,  Stony  Brook,  supervised  by  J.  Hu 

•  Yongqiang  Wang,  Ph.D  expected  2011,  LIniv.  of  Maryland,  supervised  by  M.  Fu  and 
S.  Marcus 

•  Ranit  Sengupta,  Ph.D  expected  2012,  LIniv.  of  Maryland,  supervised  by  M.  Fu  and  S. 
Marcus 

•  Kun  Lin,  Ph.D  expected  2012,  LIniv.  of  Maryland,  supervised  by  M.  Fu  and  S.  Marcus 
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